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Abstract 

This study measures the stability of pejrform^nce exhibited where different 
^ classes learn, the sane material. By focusing standard measurement technlc^ues on 

the Item difficulties. I.e. the proportion of students answering an Item Correctly, 
^ of Items common to several classrooms, It was determined that up to two-thirds of 
the reliable variance of a classroom test is h^ld in common with identical J:ests 
given In similar clasees. The particular wording of the test item measuring a 
concept was shown to 'be a critical factor in knowledge assessment. Classes were 
given identical terms measuring common concepts and changed items measuring a < 
Q different slet of common concepts. The correlations between classes of item diffi- 
culties for identical items is approximately .70 whereas the correlation for ^ 
changed items is approximately .35. Suggestions are made for utilizing the high 
correlation between identical items in instructional decision making. 
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The Certainty of Information In Instructional 
Decision Making 

Effective decisions are based on the ability to predict the outtomes of 
future events with some degree of success. The decision maker is happiest when 
he can predict future event's with total certainty,' but, in the absence of such 
good fortune, he will look for the best statistical advantage allowed by hla^ 
avallabl-e information. For example, the registrar's office qf most universities 
tsakes use of the p'ositive relationship ( r ^^s^. 5), between high school grades and 
college grades to accept a sample of applicants who will have the best prognosis 
■for college success. The purppse of this paper is to assess the certainty of 
information available to the teacher within his own classroom for his Instruc- 
tional decision making. ~ ' ' 

Rosenshine (1970)- rfevlewed studies examining the consistency of teacher 
effects in classroom or classroom-flke situations and found only nine studies 
which attempted to make such a consistency Taheck. The results of these studies 
ware disappointing In that when student achievement was the dependent variable 
very little consistency of effect was demonatrated. These studies tested many 
.'a<,qes taught by many teachers (24 to 106) with a standardized test and correla- " 
tea ra«An student. achievement for a given teacher's class with the same mean in 
the same class taught at a later time. Thirteen correlations obtained In five 
long term studies of this type ranged from-. 08 to .53 with a mean of .28. This 
approach to assessing^ classroom data stability has two major disadvantages: (a) It 
requires large numbers of teachers and students and (b) it does not provide the 
Inaividual teacher with the detailed Information needed for Instruction Improvement 
''-ctsions. The remaining studies reviewed assessed consistency In teacher ^fects 
for short (30 mln.) lectures. Positive results were shown but' again the magnitude 
of-^ consistency was not great. 

'Wiiat may prove to be a bptter approach to assessing information certainty in 
the claosroom is suggested by researcTi involving paired-associate (PA) learning. 
Coleman (1970) reviewed performance data collected from children given reading 
exerci^s In PA format. The^cj^rds the children were learning to read were rank 
ordered on the dlmensiori of item difficulty. . The rank orders from two or more 
experiments using the same words were then correlated; 31 of the correlations 
reported fell between .69 and .98 while the remainljag two were .33 and .31^ More 
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recently researchers (Atkinson. 1972; Atkinson & Paulson. 1972; Laubsch. 196^ 
have successfully used Item dlfflcultle, gathered fro. one g.oup of subjects' 
to.provlde the basis for decision, about which PA Itan to present next In sequence 
of instruction experiments. The PA experiments suggest that consistency of ef- 
fect in the classroom might be better demonstrated through the use of item 
^ difficulties computed for tests common to sevetal classes. 

Item difficulty's a notion quite familiar' to eduV:ational test and measure- 
ment specialists. However, the concern, of educational measurement has ln*general 
bean with the reliable assessment of the individual student'^ knowledge. This 
translates into estimation of how accurately the student's total score on a test • 
reflects the state of his knowledge. Answers to single^ltems are not particularly 
reliable estimates^of a single examinee knowledge ^and so individual item 
statistics are used In constructing the best possible overall test. Suppose 
instead that this emphasis were changed to regard the item difficulty, defined 
as the proportion of students correctly answering a test' item, as the statistic ' 
of major interest, if instruction is delivered under close to constant" conditions 
and if the same test item6 are used with successive classes, the.product moment 
0 --elation between two c^^asses on an ItL by item pairing should be quite high. 
This correlation of item difficulties can -be used as a means of assessing the 
stability of instructioh efforts in the classroom. 

One goal of education, broadly defined,' »ls the development of a state within 
a person called t^e learner which is similar to an in^^mal state within a person 
called the knower. ilhen the learner is in this state, he is said' to "understand". 
The state df understanding is inferred from behavior in relation to a context, 
l.«. a person who emits situationally appropriate behavior majr' be said to under- ' 
• stand the situation. May be is underlined in ^he previous sentence because 
understanding is not inferred from any "particular behavior" (Deeae 1969) 
Deese writes, "The criteria for understanding are. in the potential for an indef- ' 
mite number of appropriate reactions, some linguistic and some not." 

In writing a test item to probe the student's ability to react appropriately, ' 
the teacher is constructing one test question from an indefinite set pf test 
questions. Ideally, the student shouJ,d give the correct response to any member of 
the indefinite question set if he understands the concept being proBed or he should 
give all incorrect responses if he does not. In practice, we would e^ct the 
particular wording of a multiple choice test question to affect the estimate of 



the student's competence for at least t^p reasons: (a) Concepts stored i.n memory 
oust be retrieved from. storage and changed wordi^ig of a question could conceivably 
. change the ease of access to the concept t«eded tb answer tfie question. ^ 
(b) Changed wording of resppnse alternatives could affect the difficulty of the 
discriminations needed to Identify the correct, alternative.. One way to assess 
the lmi,act of specific wordings' Is by giving two classes learning the same sub- 
ject matter identical Iteias and items measuring the same concept with, changed 
.wording and then correlating the resulting data.- If the understanding of the 
students is keyj the correlations of both types of item difficulties (identical ^ 
items and changed items) will be the same. If item wording is a major factor, 
the oprrelation for changed items wlU be lower than the correlation for identical 
It^ms. , • , . \ — ' 

Tn addition tp^ cor.relatlons between identical test items and between changed 
items, the data/collected from the classrooms , described in the-methods section . 
allow a ntmb^ of Intentional and natural experimental , comparisons. Some of the, 
classes we^ taught using a workbook specially prepareH for the class while 
other* w^e not (Intentional). In one case, the textbook, which was common to all 
the cWBes, was changed. Many of the teaj^^ers Involved in these classes lectured 
during class periods while o^ers used the clajsroom primarily for testing and 
assisting students with problems. In some of the classes, students were given ' 
tnu^iple choice test items written by the same professor who wrote the multiple- 
c^ioice itms of a common final examination while in other classes the students 
(/ere given essay and problem quizzes designed by a different instructor before 
receiving the common multiple-choice final. Data on these comparisons are in- 
cluded with the correlation data in the results section.' 
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Methods 

« 

Description ^ classes and subjects. At the University of Washifigton the 
introductory PORTRAiJ IV computer programming classes aie handled by the general 
engineering department. Engineering 141, as the cou^ is labeled, has 10 to 12 
sections each quarter with between 15 and 30 studedts> each section. The " 
sections usually have roughly equal percentages of upper and lower classmL. The 
students are drawn from the general university population, ^but' there does tend 
to be a larger number of- engineering ptudents in each section than would be ex- 
pected from a random sample of the student body. 'The course is a four-credit 
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course whifh nonnallymeets 'for .four, one-hour periods per week, but on occasion 
it meets for two, two-hour sessions. Students in all of the sections are given 
access to the University CDC .6400 computer in order to test and, run their prac- 
tice programs. 

Course reading materials . All classes involved in the data collection from 
Autumn Quarter, 1974, anS Winter Quarter, 1975, used a common textbook, 
Fortran IV Programming by Rule, Finkinaur, and, Paj:rl/:k (19.73). Data was 
collected from a single course in the Spring of 1975 and that class used a 
different taxt, Fundar.sntal of Fortran Programming by Nitkerson (1975). In addi- 
tion to the tf-xtbooh, three Autumn i;lasses and the one Spring class used^ a 
workbook prepared locally by Professor W. Dunn of the Civil Engineering department. 
The workbook has 13 sections corresponding to topics in Fortran programming, e.g. 
DO loops and subscribted variables. Each section has^two types of problems, short, 
answer essay questions and multiple-choice questions, and in addition, 'many of 
. the sections have matching exervises. Answers are included for all of the ques- 
tions. 

. Test items. Three classes from Autumn Quarter and three classes from 
Wtnf AT Quarter ^ere given wfeekly qui?zes (13to'30 items) from the second through 
thev ninth week of the quarter. The quizzes given to the three glasses during 
the same week tested the same concepts, sometimes with identical, multiple-choice 
items and sometimes with changed, multiple-choice items. An Item was considered- 
identical if the wording of the question stem remained unchanged between two 
classes^nd If the wording of the four response alternatives was unchanged; re-i 
ordering of tho response alternatives was allowed under the identical condition. 
Changed items had at least one word changed in the question stem-, the response 
alternatives, or In both stem and alternatives^ Problems hiving the same words 
b;it new numbers were considered changed items. 

The items from all of the weekly quizzes were written by Professor Dunn, 
as were the test items used for the final examinations. Five sections of 
^Engineering 141 were given a common, 44-item final examination at 'the end of the 
AutW Quarter and eight sections were given a common, 54-item final at the 
end of^the Winter Quarter. All tests were machine scored at the University of 
Washington Educational ^ssessiient Center. The computer printout of the scoring, i 
includes an item by item analysis which gives the proportion of students jnaking 
the correct response to an item. ' 



The multiple-choice questlpns of the workbook weye a parallel form of the 
weekly quizzes. The same concepts were tested on weekly quizzes as were covered 
" by the workbook quiz with items which were in the majority of cases (55%) identical 
to those of the workbook. Except for a small number of items included in the 
Spring Quarter final examination, none of the items from the final examinations 
were identical -to items givep during the quarter. 

Teaching methods, Autumn . Three classes during the Autumn Quarter used the 
same textbook, the same workbook, and parallel fonis of the weekly quizzes. All * 
three of these classes were taught using a semi-mastery instruction method which 
allowed each student scoring below 90% on the weekly quiz the first time it was 
given to retake a parallel form of the quiz. The student was allowed to study 
his first test results to determine his errors before taking the second quiz; 
' ^If^en^^s were scheduled for the first and second testing sessions during a 
■ week at the same time. Mastery instruction typically allows self-pacing, hence, 
the ustf of the term "semi-mastery" in describing the method. Class time was used 
to handle details of course' administration and to answer student questions on an 
individual basis. Very-little lecturing was done in these classes.' The two 
additional classes given the common final in the Autumn Quarter were taught more 
traditionally with lectures during class and single try test sessions. 

Teaching methods. Winter. During the Winter Quarter three instructors were - 
•again compared on the weekly quizzes, two instructors used the semi-mastery 
method and their instructor adopted a lecture approach. This third instructor 
placed special emphasis on structured programming (Dijkstra, 1973) in the hope 
of improving the programming skills of his students. Five additional instructors 
used the final test; their instructional methods are best described as tradi- 
tional lecture. Sotije of these classes were given weekly quizzes composed, of 
programming problems designed by the instructor of the section. 

Results 

The primary data reported are the correlations of item difficulties among 
classes for identical items and the correlations among classes for changed items. 
The reader should bear in mind that the items contained in the identical set and 
the changed set are not the same for each correlation reported, e.g. the identical 
item set between class one and class two does not match the identical item set 
between class one and class three. In some cases items were discarded from the 
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tests by Instructors in one or more of the sections because of dissatisfaction 
with the Items; all Item discards were made before the tests were scored. Each ' 
correlation reported Is followed by the nipber of Items Included, In the corre- 
lation, e.g. .76 (33). Note that the number In parenthesis Is the number of 
Items Included In the comparison and Is not the number of subjects used In . 
computing the Item difficulties. The number of subjects used to determine Item 
difficulty Is always between 15 and 30. 

The measurement theorist usually begins from a two-dimensional data iqatrix 
in which one dimension is a listing of the individual subjects and the second 
dimension is a listing of the test items. Each subject-item cell in the matrix 
Is filled with a one if that subject responded to that item correctly and with a 
zero if an Incorrect response occurred. The formulas derived from test theory 
for the manipulation of this data matrix are designed to estimate the reliability 
of the test in measuring the student's knowledge. Throughout the results section 
there is a shift from thi? perspective. In the standard approach the test items 
are seen as measuring the student; in the Analysis performed here the students 
as a class are seen as measuring the difficulty of the test items. The same data 
matrix is used in the shifted perspective, but the fomulas used in computation 
with the data are analogs of the standard formulas. For example, coefficient 
alpha, in the case of dichotomous items, takes the following form (Nunnally, 1967) 



k 

j = 1 



k.-l \ . -^2 



S, 

Where p is the proportion of students getting an item correct,, q is the proportion 
getting the same item incorrect, k is the number of items, and 6g^ is the var- 
iance (?f the' subjects' total scores. Coefficient alpha is computed as follows s 
under the changed perspective. 



r = 
nn 



n - 1 




(1) 
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Where n is the number of subjects, c is the proportion of the items a subject 
correctly answers, e is the proportion of the ftems the same subject answers 
Incorrectly (e = 1 - c), and is the variance of the tatal scores of the 
items. The second form of coefficient alpha ie^^a measure of the" reliability of 
the item difficulty estimates v/ithin a single clkss. 

The item difficulty correlations obtained duMng the Autumn Quarter among 
■ the three semi-mastery instruction sessions are shown in Ta lie 1. The mean 

correlations from Table 1 are .65 for identical items and .33 for changed 
. items. The mean difficulty of the items, the standard deviations of the it^s, 
•and the dependent t test values between the classes compared are shown in 
Table 2. Three t tests are reported instead of one analysis of variance because 
the item sets vary from .comparison to comparison. Note that the meaa-<ttem 
difficulty from the test items tends to be high (approximately 85%). -The range 
oi item difficulties is restricted and the coi;relations reported in Taljle 1 may 
underestimate the magnitude of relationship that actually exists between classes 
(Minium, 1970, p. 190). Data from the final examination given to five sections 
of the programming class in the Autumn Quarter is shown in Table 3. The mean 
correlation from Table 3 is .73. See^ble 4 for the mean item difficulties 
and standard deviations of the five classes. An' items X classes repeated 
measures analysis of variance done for the 39 items. of the final examination that 
all classes answered shows significant variabilityjamong the classes 
. ^^4,152 ° P .001). Orthogonal q[6ntrasts skow^the mean iteirf difficulty 

of class three. to be significantly greater than the mean item difficulty of 
class two (Fj^j^2 = ^^7^ p < .05). Class five and class four also show a sig- 
nificant ^difference (F^^^52 = 9.95, p <-.01). Any interpretation of the 
signi-ficant orthogonal contrasts in terms of instruction received is confounded 
by the facts that class three contained 80 percent upperclassmen whereas the 
normal class contains approximately 50 percent upperclassmen and that clas^five 
was told in advance th^t sc^es on the final examination would not be included 
in calculations of their course grade. 

During the winter quarter, comparisons ^/ere made among two semi-mastery • 
courses and a, third course which emphasizecj^ the structured approach to- program / 
^writing (See tables 5 and 6 for the data from these comparisons). The average 
correlation among the three classes is .70 for identical items ani .39 for 
changed items. The correlations from the winter quarter .replicate the autumn 
carter results. Jable 7 shows the' inter correlations of eight class'es taking 
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Table 1 

Item difficulty correlations of three sennl-nastery courses 
for identical and changed items 



Class 
Number 



Class Number 



Identical Items 



1 
2 
3 



.68(78)* 



.70(47) 
.57(47) 



1 
2 
3 



Changed items 



.44(45) 




.34(28) 
.21(28) 



^The number in parenthesis is the number of test items used 
in calculating the correlation. 
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Table 2 

Mean Item difflcu^tlesj^ standard deviations, and t test values 
for comparisons, among three semi-mastery classes 



Classes 
Compared 



Statistics 



First 
Mean • 



Second 
Mean 



Sdl 



Sd2 



Identical Items 



1 4 i 

1 & 3 

2 & 3 



86.09 
84.19 
85.70 



85.55 
86.83 
86.83 



12.16 
12.99 
12.32 




1 & 2 

2 & 3 



77.11 
79.4^ 
83.04 



Changed Items 



80.07 

90.14* 

90.14 



16.53 
15.73 
13.62 



7.78 



-1.20 



-3.77 



-2.64 



p < .01 
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Table 3 * 
Item difficulty correlations from the common, autumn quarter 

final examination 



Clads 
Ifumber 



Class Number 



-1 
2 
3. 
4 
5 



.76(44) 



a*. 



.86(41) • .79(42) .68(44) 
.73(41) .71(42) .64(44) 
.71(39) .65(41) 
.77(42) 



Note . Classed one, two, and three In this table are the same 
as classes one, two, and three of table 1-. 

°The number in parenthesis is the number of test items used in 
calculating the 'correlation. 
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Table 4' 



A' " 



, Pinal examination mean item dif f icul^s and standard deviations ^ 

"^or five autudb classes 



Class 
Ntimber 



4 

~5 ' 



1—. 

Statistics 



Mean 



67.86 
■70.09 
75,61 
68.29 
59.89 



Standard Deviation 



23.1fe 
19.79 
19»^9 
24.t02 
24.68 



Note. Classes one^ two, and/ three in this table are the 



sane 



as classes one, two, and three of table 1. 
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an identical, 54-item iinal examination in the winter quarter. The mean correla- 
tion from table 7 ia .71; this value is very close to the value (.73) obtained 
for the fall classes. An items X classes repeated measures analysis of variance 
done with deven of the classes shoved significant variability among the class 
me a n s (Fg ^07 • 3. 35, < .01). Item difficulties were not present for il 
of the cells ^n the data ^triz; their values were determined ns^:ig a missing 
data estimation procedure recommended by Myers (19*66, p. 171). Class eight, 
the structured prograimnlng section, had 13 missing item difficulties. Since 
the mean of this section was near the grand mean of all sections, the decision 
to exclude this section from the analysis because of the missing data probably 
produces a slight inflation of the F statistic (the between means variance 
estimate is high).' An orthogonal contrast of the semi-mastery instructiola 
sections one (mean item difficulty « 70.13) and three (mean Item difficulty » 
63.58) shows significant difference (^j^ 307 ° 7.75, p < .01) as does a contrast 
of the^high and low 'traditionally taught sections (F^^ ^97 " 6. 76, p < .01). An 
orthogonal cjomparlson of sani-*mastery classes and traditional classes shows no 
significant difference associated with the type of clas? (F^ ^97 " 1*93, p < .10), 

Table 8 shows the means and standard deviations of the eight classes taking the 

winter ^Inal examinations. 

The correlatlois^ between different^^ectibns^ of engineering 141 should be 

compared with the values of coefficient alpha for the sections (See equation 1). 

These coefficients indicate the reliability of item difficulty within eaoh 

section and represent the maximum correlation that could be expected between 

sections. Table 9 presents coefficient alpha fox the eight winter quarter Classes. 

The average correlation (.71) from the intercor relation matrix should be compared 

with the average value of coefficient alpha (r « .86) instead of with the 

nn ^ 

maximum possible product moment correlation, i.e. 1. The square of r when the 

nn 

square is multiplied by 100 is an estimate of the percent of the total variance 
within the cla^MS that is reliably measured by the test Instruments. The 
reliably measured vkylance accounts for 74% of the total variance whereas the 
variance common to the classes is approximately 50% of the total variance. A . 
combination of these two figures suggests that up to two'^thirds of the reliable 
variance from the measuring Instruments is common to the eight classes. 

One instructor was followed through the autumn, winter, and ^ring quarters. 
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Correlations of item difficulties among two-semi-ina8teiTr(SM) 
and Oae structured programming (SP) classes 



SP 



Class 



Class ' ; 

■ SMI SM2 SP 



Identical Items 



SMI - .77(74)^ . M(65) 

S»2 - .59(64) 

♦ 

SP • 



Changed it 



SMI - .47(29) .34(32^) 

SM2 ^ ^ - .37(26) 




numbejr in parenthesis is tM^ number of test items used 



In calculatiM the correlation. 
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Mean Ite^i difficulties, standard deviations, and dependent 
t test values for comparisons among two semi-mastery (SM) 



and one structured programmingCSP) classes 

' ■ • \ 


Classes 


, 4 


Statlst4.cs 

! 






uouiipareo 


mm 

First 


Second 


<> 






Mean 


Mean Sdl 


Sd2 








""""" * 

Identical Items 






Sm & SM2 
SMI & SP 
SM2 & SP 


81.97 
. 82.18 
74.69. 


72.78 > 15.72 

77.60 15.94 

79.63 18.83 
> 


19.30 * 

21.09 

18.64 


** 

6.34 
2.61* . 
, -2.34 






\ 

Changed items 




i 


3 






* 





SMI & SM2 
SMI .& S» 
SM2 & SP 



81.57 


73.97 . 


£6.66 


21.91 


2.02 


80.13 


77.25 


17^59 


19.57 ' 


j^.76 


74.88 • 


73.69 


19.^9 


20.78 


.27, 



p < .05 
**p < .01 



ERIC 



1.7 



Class 



/ 



15 



Table 7 



Correlations of Item dlfflcultlea* from a Winter Quarter final 
examination given to eight classes 



Class Number 

• 



Number 12 3 

9' 



1 - .72(54)* .68,(48) .70(52) .81(54) ..7^51) ,76(54) .73(41) 

2 " - .69(48) .71(52) .78(54) .77^51) .79(54) .68(41) 

3 •• - .66(47) .73(48) .77(47) .73(48) .60(39) 

4 ' • . - .66(52) .64(49)' .57(52) .70(40) 

5 ' . - .86(51) .83(54) .61(41) 

6 ^ - .81(51) .61(41) 

7 , , \ . ^ .55(41) 

8 . 



Note ; Classes one, three, and elgWare Identical tp clas&es SMl', 
SM2, and SP In Table 5. • * 

*The number In parenthesis Is the number te^t Items used In cal- 
culating the correlation. 
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Table 8 



Pinal examination mean item difficulties and standard 
deviations for eight winter classes 







Statistics 


Cl£L88 






Number 

-If- 


Mean 


1 

Standard Deviation 




70.13 


20.-30 








2 


\ 68.33 
J 


21.99 


3 


63.59 


22,05 


4 ^ 


66.25 


' ■ 24.69 


5 


66^8 


23. 6> 


6 


61.96 


^^^Q^ 22.35 


7 


^ 62.28' 


C \ 26.93 


8 


6A.80 


24.63 


Note: 


Classes one, three. 


and eight are identical to classes 


SMI, SM2, 


and SP in Table 5. , 






Table 9 



Coefficient alpha for eight Winter Quarter classes 



J 



Class 
Number 



Coefficient 
Alpha 



Number of 

Students 
/ 









1 




.78 








2 




.83 






.86 


4 




.86 


5. 




.88 


6 




.86 


7 


♦ 


.92 


8 




.89 



19 

21' 

25 

20 

24 

25 

22 

23 



Rir 



13 

He used the semi-mastery. methods of instruction each quarter, but varied the 
^ written niaferials given to the students. Written material ist the fall included 
the Rule, et. al.,' 197^ text and the workbook, in the Winter Quarter the work- 
book was removed, and in the Spring Qxiarter the workbook was reintroduced along 
with a change in textbook (Nickerson, 1975). The fall-winter correlation for 
identical items is .61(96) and for changed items is .71(49). The comparison 
for winter-spring are .24(54) and ,09(59). The low correlations for winter- 
spring are due primarily to the extremely high spring test scores and their 
consequent lack of variability. A dependent t test comparing autumn and winter 
results showed no significant difference between the identical item means 
(autumn mean = 82.58 and winter mean 81.40; t^^ « .82) and a similar test showed 
no significant difference between the changed item means (autumn mea^ » 76.96 
and winter mean » 75.92; t^y^^ .47). These same comparisons were significant 
between the winter and spring quarters (winter identical » 85.40 and spring 
identical - 96.95; t^^ -6.56, p < .01; winter changed = 75.71 and spring 
changed « 85.06; F^^ « -2.99, p < .01). Comparison of items common to the 
,fall and winter final examinations shows a correlation of .89(18) and the same 
comparison for winter-spring shows a correlation of .45(37). For final examina- 
tions no significant difference was found between the fall mean (72,05) and the 
Winter mean (68.00) (t^^g » 1.45, p < .10) or between the winter mean (68.64) 
and spring mean (69.89) (t^^ « -0.34). In short, eveit? though the performance 
of spring quarter students differed from the performancife of winter quarter 
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! Students during the quarter, the changes made within the instrictor^s classes 
. did no^affect the mean performance of stiMents in different quarters on common 
final examination items* 

Discussion 

; The high alpha values found in ^his study can be interpreted as indicating 
that within a class item difficulty is a very reliable measure. To put it in 
a more Important way, if an item is relatively difficult for one student, it 
is likely to be difficult for other students. The high correlations resulting 
from pairing item difficulties from identical item sets clearly indicate a 
high degree of stability among the classes surveyed. The average correlations 
of .73 and .71 from^ the classes taking the autumn and winter final examinations 
mean that ^pr^lmately 50% of the performance variance of one classes scores 
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can be predicted if "an Item jilf flculty analysis Is available from another classes 
performance on those same Items. Note that this statement holds true when many 
variables normally thought to influence instruction are ignored. The semi- 
mastery classes, in addition to a common teaching method, used common learning 
materials, i.e. same text, same tests during the quarter, same workbook (autumn on 
(autumn only), yet the correlations between the semi-mastery classes are not ■ 
different from the correlations among classes having th^ textbook as the only 
c'ommon reading source. The correlations of the semi-mastery classes with tra- 
ditional classes is not different from the correlations among semi-mastery 
Courses themselves. The students in all of. thesi classes were faced with the 
problem of extracting information about computer programming from written or ^ 
verbal statements, and they seem to have solved this problem in the same way or. 
at least with the same degree of success ih each of the classes. Teacher 
personality, method of instruction, classroom enviromil*nt , and any other varia- 
bles present but unmeasured and "unrecognized did not substantially effect the 
learning of the students. The classes wUe ^ther constant with respect to . 
such variables and hence equally affect|l ov the variables do not have a major 
effect on student performance. r'. » 

Data gathered from item difficulties 4:llected during the fall and winter 
quarters support the conclusion of high stability between classes when correla- 
tions of identical items are used to assess, stability (F = .65 for autumn quarter 
and r = .70 for winter quarter). Howevei^,.-^ltering the wording of the test 
questions used to probe the same studentas'' knowledge of programming concepts 
substantially lowers the correlation foun^.between classes (? = .33 for autumn 
and r = .39 for winter). The assessment of the students' knowledge is related 
to the particular wording of the test question use to probe that knowledge. 
On the other hand the positive correlatioA fhat remains after wording changes ' 
suggests that^itdn difficulty measures o( ^ c^^pn concept will show consistency 
when compared to the variability of estimates made for different concepts. 

The data support the conclusion that treatm^ents aimed at the entire set of 
concepts the student was to acquire were not ^£jf4tive. The semi-mas tery-tra- 
•d^-^°'»al instruction comparison, the workbook^orljworkbook comparison, and the 
structured programming-trad^tional programming J|nparison. all failed 'to produce 
significant differences beWn classes on th? f |,al examinations. This finding 
Is in agreement j^th a general^ tendency to find-pB' results in such comparisons 

^ .'m 

no 



20 



(e.g. Dubln & Taveggla, 1%8; Getzels & Jackson, .1963; Stevens, 1967;'Wallen & 
Traver 3,^1 963). The treatments used In classrooms generally do not alter the. 
learning of the students in ways that are detectable in their performance. 

The high correlations found between classes present an alternative to the 
approach of attempting to affect the learning of the entire set of concepts. 
Since we know a large number of test questions will be readily answered, why not 
focus the treatment where we know thd students will have trouble answering 
questions? \e might, for example, provide the ^tudent with a workbook which 
contains brief explanations and practice' problems for concepts we know (fr^ 
prior data colleefiion) the student, is likely to have trouble mastering. Problems 
related to readily learned concepts would be left out of the workbook entirely. 
Such a tactic may not change the student's learning strategy, but fhe selective 
application may influence the student^s allocation of >ffort. Ifliat is being 
recommended here is the systematic selection of treatment focal points from 
objective data collection. * 

If we accept^jOee^^'s notion that understanding leads to appropriate behavioy 
in response to an indefinite set of related situation, accept the ,high correla- 
tions of item difficultie's from identical items giVen in different classes as 
an indication of the stability of the item difficulty measures, and accept the 
premise that low item difficulties indicate a misunderstanding on the part of 
several class members, we are led to some direct conclusions about instructional 
Improvement. Ideal understanding of a concept «ould lead to an appropriate re- 
sponse on the part of all students to all items from the indefinite set for 
the concept. A low item difficulty on any item from the set indicates less than 
perfect understanding even though the remaining items from the set might' be . ' 
answered correctly. We are thus justified in modifying instruction on the basis 
of information collected from a single, specific test item. If an improvement 
is registered in a subsequent quarter on an item receiving focussed attehtion. 
we would then reword that item for the next teaching of the class to insure that 
other members of the indefinite set are also favorably affected. In other words 
an attempt is made through changes in the course materials to selectively shape 
student test performance but the possibility of shaping being confined to exact 
test item wording is avoided by changing item wording and reassessing the under- 
standings drawn by the students in a subsequent version of the course. 

The study reported here has several limitations which deserve attention, 
(a) The data was gathered from courses teaching Fortran programming; there is no 
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guarantee that the data will be duplicated in courses of a different type, e.g., 
poclal science courses, (b) The item difficulties were gathered with a single 
type of teaf question, i.e., multiple choice; there is no check made to determine 
if other testing modes will produce similar results, (c) There is a need to- 
follow more instructors from quarter to quarter, particularly in view of the 
failure of within course data to replicate between course data (identical item 
•r = .61 and changed item r = .71). (d) No satisfactory explanation is offered 
for the significant variability found within instructional methods, (e) And 
finally, the use of repeated measures item X classes analysis of variance 
assumes a random sampling from a normally distributed pool of item difficulties 
which in fact did not occur. This same criticism is, however, also 'true of many 
subjects X treatments analysis, particularly when students' from a class are 
treated as randomly assigned to the class. -y 

The stability of item|dif f iculties from quarter to qu'atter and class to 
class opens new possibilities for educational research. Since test' 'items ^can 
be transferred from class to class, class comparisons can be matched on an item 
by item basis to provide more sensitive comparisons via dependetjt t tests and 
repeated measures designs. . Since difficult items can be feliably, identified, 
selective strategies which specifically focus on difficult items can be attempted. 
Given the difficulty of establishing adequate control in classroom resekrch^ the 
potential 'of stable item difficulties for the production of more sensitive 
measurement is welcome. . , Jwc, 
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Footnote 

^The author wishes to thank Dr. Gerald Gillmore for his critique of an 
earlier draft of this paper. Special thanks also go to the engineering pro- 
cessors who had their classes participate in this study. Professor W. Dunn 
was .largely responsible for gaining the cooperation of the engineering 
faculty as well as being responsible for the preparation of the written 
isaterials used in this study. Everyone who has occasion to do classroom 
evaluation research should be blessed with such a willing ally. 
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